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We present a compact matrix formulation of the modularity, a commonly used quality measure for 
the community division in a network. Using this formulation we calculate the density of modularities, 
a statistical measure of the probability of finding a particular modularity for a random but valid 
community division into C communities. We present our results for some well-known and some 
artificial networks, and we conclude that the general features of the modularity density are quite 
similar for the different networks. From a simple model of the modularity we conclude that all 
connected networks must show similar shapes of their modularity densities. The general features 
of this density may give valuable information in the search for good optimization schemes of the 
modularity. 



I. INTRODUCTION 

The nodes of a network can be grouped into communi- 
ties which are loosely defined as groups of nodes that are 
more "related" to each other in some fashion than they 
are related to the rest of the network. Such a community 
division can reveal important structures of the network. 
In a recent study, for instance, Wilkinson and Huber- 
man 1] introduced a method to create a network of gene 
co-occurrences from the literature and interpret its com- 
munities as groups of genes related to each other by their 
function. Since some of the genes in these communities 
are not known to be related to the community's func- 
tion, this method possibly aids in identifying unknown 
relationships of this sort. Massen and Doye [2| used a 
community analysis on a potential energy landscape to 
identify transition states of small Lennard-Jones clus- 
ter. Networks have been very successfully used also to 
simulate dynamics in various systems. By modeling a 
community structure of individuals using a contact net- 
work model, Meyers et al. 3] predicted the dynamics of 
a SARS outbreak. 

Many different approaches have been used to identify 
community structures in networks. To name a few more 
recent methods: vertex similarity |4|, vertex degree gra- 
dient 5], resistor network Potts Hamiltonian model 
, and an information-theoretic approach Q . The most 
popular methods appear to be ones based on the net- 
work modularit y Q introduced by Newman and cowork- 
ers 0, [Hi HH US? LL-1 • The advantage with the modularity 
Q is that it is a well defined number that gives the qual- 
ity of a particular community division in a network. It 
is bounded, — 1 < Q < 1, and is larger for divisions that 
split the network into groups with many intra-edges and 
few inter-edges between the groups. 

A number of different strategies have been proposed 
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for finding the optimal community division based on the 
modularity. These methods can be broadly divided into 
two different classes. Path bound methods are agglomer- 
ative or divisive and either successively add or take away 
edges in the network so as to reduce the number of com- 
munities by merging existing communities (agglomera- 
tive) or to increase the number of communities by taking 
away edges and splitting existing communities (divisive). 
In both cases, the number of possible community divi- 
sions depend on the previous steps in the algorithm, or 
the particular path that was taken in the space of all pos- 
sible community divisions. The resulting evolution of the 
community structure is commonly called a dendrogram. 
The different methods in this class differ in the way they 
identify the edges to be removed or added. Examples 
are the shortest-path betweenness , random-path be- 
tweenness [llj , or the greedy algorithm 0, ^( . All these 
methods have in common that they follow a dendrogram 
and attempt to identify the edges to be removed or added 
by optimizing the effected modularity change. The num- 
ber of communities is changed by at most 1 in each step 
and only information from the previous step is used. The 
quality of these methods is very sensitive to the strategy 
employed for identifying the critical edges. 

Methods in the second class are not path bound and try 
to optimize Q directly without regards to a dendrogram. 
Simulated annealing * s a recent example of one 

of these techniques, but other techniques as for instance 
genetic algorithms are also possible. 

Current results suggest that non path bound optimiza- 
tion strategies outperform dendrogram bound methods 
[~L7^ . However, the number of possible ways of dividing 
a network with N nodes into C communities is immense 
and given by the Stirling number of the second kind 
|l4j . Due to the discrete nature of how nodes are assigned 
to communities, the modularity takes on a discrete set of 
values. The possibility exists that several divisions are of 
similar high quality, or that Q is degenerate as a function 
of the community division. All of these properties of the 
modularity make it difficult to optimize the modularity 



in a non path bound way by using standard optimization 
techniques. 

In this article we study the properties of the modular- 
ity Q in a statistical sense. Our aim is to gain a deeper 
understanding of the complexity of network community 
divisions in general, and the modularity in particular. By 
gaining knowledge of the modularity we believe that it 
may be possible to find faster and more accurate commu- 
nity division algorithms. We introduce a matrix algebra 
formalism to define a connected community division and 
obtain the modularity. We calculate a community divi- 
sion density in the modularity-community space where 
we show how the values of Q are distributed in terms 
of the number of communities in the division for several 
different networks. 

This article is organized as follows: In section^we will 
describe the modularity and introduce a matrix represen- 
tation of it. In section imi we will present and discuss our 
results and in section HVl we present our conclusions. 
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FIG. 1: (Color online) A set of modularity plots obtained 
from random dendrogram walks in the Zachary Karate Club 
friendship network. On the x-axis is the number of commu- 
nities, C. 



II. THEORY 

A. Modularity and its Matrix Representation 

A network can be represented by its corresponding ad- 
jacency matrix A. For a network with N nodes, this 
matrix is of size N x TV where the element A^ repre- 
sents the edge between nodes i and j. For unweighted 
networks, A^ = 1 if the edge exists and otherwise. 
For weighted networks, Aij = Wij, the weight associated 
with this edge, and in the case of an undirected net- 
work, A is symmetric. If we do not include self-edges, 
the diagonal of A is zero. The underlying network can 
be divided into C communities, which amounts to la- 
beling each node with one of C community labels. A 
compact way of expressing a specific community division 
is through the community matrix P which we define as 
a matrix of size C x AT, with elements given by 

p _ ( 1 node j is a member of community i ^ 
13 ] ~ \ otherwise. ^ ' 

Newman's assortative mixing matrix |9(, e, can then be 
expressed as 



e =(X>;j PApT > ( 2 ) 

where P T is the transpose of P. The modularity is given 
by 0113 

Q = Tr(e) - ^>%, (3) 

ij 



The larger the value of Q the better the community di- 
vision. The modularity has the property that it has an 
upper bound, Q < 1 — 1/ C. This has to be regarded as a 
theoretical upper bound, however; in practice the upper 
bound is lower. 



B. Statistical Analysis of the Modularity 

The modularity can be interpreted as a function of the 
community matrix P. In the space of all possible com- 
munity divisions, Q(P) defines a rugged and complicated 
surface. In Fig. E] we show several curves of Q(P) vs. C 
obtained by randomly choosing a path through this space 
along a dendrogram in the Zachary Karate Club network 
[T^ . The path is chosen by starting with a diagonal N 
x N matrix P so that all nodes are in their own com- 
munity. Then, by summing two randomly selected rows 
in P we merge two of the communities. We can easily 
check that the new community is connected by checking 
the assortative mixing matrix e. We can see in the fig- 
ure that the qualities of the various community divisions 
depend strongly on the chosen paths and the success in 
the previous steps. 

In the following we will try to gain a more general 
understanding of the structure of Q(P) for different net- 
works. We will not attempt to optimize the modularity 
but rather map its structure in the space of possible com- 
munity divisions. Put in other words, we would like to 
know how many community matrices P exist for a given 
number of communities C in a given modularity interval 
between Q and Q + 5Q. If we start out with the com- 
pletely split network, i.e. C = N, and sample Q(P) along 
a random dendrogram until we reach C = 1 we will find 
that some values of Qc are more likely to be found this 
way than others. After sampling a large number of these 
random dendrograms we can analyze the result as a fre- 
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quency distribution f(Q) vs. Q and C and get a density 
of modularities, N(Q). We expect the following: For a 
given number of communities, C, there will be a range 
of possible values of Qq. We know from previous studies 
that it is difficult to find a division with a large modular- 
ity. This implies that it is unlikely for us to find a large 
value of Q with our random dendrograms and likely that 
we find some average Q. 



III. RESULTS 
A. Examples of modularity densities 

1. Real networks 

In Fig.^we present plots of the modularity as a func- 
tion of C along random dendrograms in the Zachary 
Karate Club network The modularities in these ex- 
amples are quite different. By performing a large set of 
such dendrogram walks where we save each modularity 
plot we obtain a statistical image of the modularity. 

In Fig. [21 we show the modularity density for the 
Zachary Karate Club network where we have calcu- 
lated 100,000 modularity plots from random dendrogram 
walks. We find that the modularity density is not uni- 
formly distributed in the modularity-communities space 
and has a strong peak for large values of C. This peak 
decreases rapidly as the number of communities is de- 
creased. For small values of C the density is low and 
spread over a large range of modularities. By construc- 
tion, the integral over modularity for constant number 
of communities is always the same. Consequently, if the 
peak is very low, the range over which the modularity 
density is spread out will be large and it thus seems most 
probable to find the maximum Q within this regime. 

We notice that although the density is very low in the 
case of a small number of communities the peak does not 
disappear and the top of the peak outlines a curved shape 
in the modularity-communities space. The position of 
this peak determines the most probable Q{C) relation 
for the network. 

In Fig. 03 we show the modularity density surface for 
the network of simultaneous appearance on sta ge f or the 
characters in the Les Miserables musical [ill [ill- We 
find that the shape and general features are similar to 
the Zachary modularity density surface. The main dif- 
ferences are that the Les Miserables surface shows smaller 
regions of negative modularity and a more shallow cur- 
vature. In these two examples, we find that the gen- 
eral features of the modularity density are very similar. 
Therefore we will investigate the modularity density of a 
few artificial networks to see whether the same general 
features can be found. 
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FIG. 2: (Color online) Surface and contour plots showing the 
distribution of random dendrograms for the Zachary Karate 
Club friendship network. The number of dendrograms in this 
figure is 100,000 and the modularity bin size is 0.01. On the 
x-axis is the number of communities, C. 



2. Random networks 

In Fig. ^ we show the modularity density for a random 
network with 34 nodes and 78 undirected edges, which 
is the same average degree ((k) = 4.59) as the Zachary 
Karate Club. The edges are randomly distributed how- 
ever and the network does not exhibit any particular 
community structure and certainly not the same com- 
munity structure as the Zachary network. The modular- 
ity density on the other hand does exhibit a very similar 
structure to what we found for the networks of Figs. [21 
and 03 This indicates that to a large extent the structure 
of the modularity density is not associated with the par- 
ticular community structure, but rather with the network 
itself. 

In Fig. we show the modularity density for a random 
network with 34 nodes and 78 undirected edges. Before 
distributing the edges, the nodes were randomly assigned 
to 2 communities. The edge distribution was random 
with the constraint that the probability p[ n of connecting 
two nodes within the same community was chosen to be 
10 times larger than the probability p out (see Ref. [T]| or 
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FIG. 3: (Color online) Surface and contour plots showing the 
distribution of random dendrograms for the Les Miserables 
network. The number of dendrograms in this figure is 100,000 
and the modularity bin size is 0.01. On the x-axis is the 
number of communities, C. 

Sec. IlllBll for a more detailed description of the Pm/Pout 
algorithm) of connecting two nodes which are in different 
communities. The generated network is shown in Fig. 
and it shows a clear community structure. 



3. Fully connected network 

Fig. [7| shows our results for a fully connected network 
of 34 nodes. The overall structure of the modularity den- 
sity is markedly different from the other cases and we at- 
tribute this difference to the much higher level of connect- 
edness of the nodes. In this case the modularity is always 
less than zero and the maximum value is at C = 1 , where 
Q = 0. The fact that the modularity is negative is due to 
the fact that the number of inter-community edges is al- 
ways larger than the number of intra-community edges 
for this network. The off-diagonal elements of the as- 
sortative mixing matrix e are therefore always large and 
contribute strongly to the negative term in the expression 
for the modularity, eq.[31 The average degree in this fully 
connected network is (k) — 33 as compared to (k) = 4.59 
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FIG. 4: (Color online) Surface and contour plots of ran- 
dom dendrograms for a random network with 34 nodes and 
the same number of edges (but randomly distributed) as the 
Zachary Karate Club. On the x-axis is the number of com- 
munities, C. 

for the Zachary network. 

4- Symmetric network 

In Fig. [51 we show the modularity density of a com- 
pletely symmetric regular network consisting of 64 nodes 
arranged in an 8 x 8 grid with 4 edges per node and 
periodic boundary conditions. Intuitively, this network 
should not show any strong community structure. But 
as shown in the figure, there is a high modularity region 
around C ~ 8 communities. 



B. Analysis 

1. General networks 

The appearance of the modularity density is very sim- 
ilar for different networks and in particular the shape of 
the most probable Q(C) region, the "ridge" in the den- 
sity surface plots, shows remarkable similarity between 
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FIG. 5: (Color online) Modularity density for a random net- 
work with 34 nodes and 78 edges. The nodes were randomly 
assigned to 2 communities and the edges were randomly dis- 
tributed under the constraint that the probability of connect- 
ing two nodes within the same community is 10 times larger 
than the probability of connecting two nodes which are in 
different communities. On the x-axis is the number of com- 
munities, C. 




FIG. 6: (Color online) Network structure of a random net- 
work with 34 nodes, (k) = 4.59 and £>m/pout = 10. 
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FIG. 7: (Color online) Modularity Density for a fully con- 
nected network with 34 nodes. On the x-axis is the number 
of communities, C. 

different networks. In an effort to describe the general 
shape of the Q(C) ridge we will employ a simple model 
of the modularity as a function of the number of commu- 
nities. We start by observing that in order to maximize 
the modularity it is desirable to minimize the number of 
off-diagonal elements and evenly distribute the diagonal 
elements in the assortative mixing matrix e. In the case 
where the communities are completely disconnected, the 
maximum modularity is given by 

q = i - ^ (4) 

where the first term is due to the trace of e and the sec- 
ond is due to the ^^(e 2 )^ term of eq. 03 This formula 
is the upper theoretical limit of the modularity for any 
network but does not describe the general shape of the 
modularity density very well. In order to make a correc- 
tion to this formula for networks that are connected we 
introduce the function 5(C) which represents the average 
number of edges connecting a community with the other 
communities. For simplicity we will only consider cases 
where the communities have equal size and the edges be- 
tween the communities are distributed evenly. The C xC 
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FIG. 8: (Color online) Modularity density for a regular net- 
work of 8 x 8 nodes on a two dimensional square grid with 
periodic boundary conditions. On the x-axis is the number 
of communities, C. 



assortative mixing matrix will then look like 



ff-S(C) 



6C= M 



5(C) 
C-l 



C 



6(C) 



V. 



(5) 



where M is the total number of directed edges in the 
network. The modularity in this case is given by 



1-C 



M 



1 

C' 



(6) 



In order to make a reasonable estimate of 5(C) we will 
revisit the pi n /p out -model already used in section [Til A 21 
The model, as introduced by Newman and Girvan is 
used to generate networks with a predetermined commu- 
nity structure by randomly choosing pairs of nodes and 
connecting them based on the two probabilities p- in and 
p out . The ratio of the probabilities determines the extent 



of community formation in the network. In addition, the 
values of the probabilities are chosen such that the aver- 
age degree per node, (fc), can be controlled. Since we are 
interested in deriving a simple closed form expression for 
5(C) we approximate the probability of finding a partic- 
ular pair of nodes by assuming that the network is empty 
and does not contain any edges. Our approximation will 
be good for sparsely connected networks and few commu- 
nities, but will become progressively worse for C — > N 
and (k) — > N — 1. In a network without self-edges and 
C communities we therefore require that 



f - 1 ) Pil 



^(1-^1 Pout 



(k). 



(7) 



As a shorthand we introduce A = Pin/Pout > 1, a freely 
tunable parameter. The first term in eq. {7J) corresponds 
to the average number of edges per node connecting two 
nodes within the same community, (k- m ). The second 
term is the corresponding number of edges connecting 
two nodes in different communities, {k out ). It is this sec- 
ond term that we need in order to estimate the parameter 
5(C). The parameter 5(C) is given by 



S(C) - ^ (k out ) - ^ (k) ( 1 _ c /x )x +( C _ 1 y 



(8) 



The fully connected like network can easily be derived 
now by setting A = 1, and 



5(C) = (k) 



N 2 C-l 
N-l C 2 ' 



(9) 



We chose N = 34 and (k) = 4.59 to model the Zachary 
Karate Club and our results for a range of different values 
of A are shown in Fig. EH in the upper and lower panels, 
respectively. As can be seen from the figure, the model 
reproduces the shape of the most likely modularity, the 
modularity density ridge, well for the networks shown in 
Figs. 0-0 and|Sl In particular the general feature of 
a peak in modularity at a small value of C is reproduced 
solely by the introduction of the 5(C) term that describes 
the average number of edges that connect communities. 
We observe that any connected network can be approx- 
imated by an appropriate choice of 5(C). This suggests 
that any connected network will show a peak in the mod- 
ularity density for small C <C N just as our numerical 
results for real networks indicate. Note that the model is 
such that one needs to generate a new network for each 
value of C in order to keep A constant. Consequently, the 
modularity plots in Fig. are really to be understood as 
a collection of modularity values for different networks 
given the constraint that A is fixed. It is interesting to 
note also that the fully connected network, shown in Fig. 
[7[ is qualitatively reproduced by our model in the case of 
A = 1 although (k) is far from the fully connected value. 
In our model, (k) represents only a scaling constant and 
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FIG. 9: (Color online) Modularity plots for the pin/pout- 
model. The upper panel shows A = 2, 4, 6 and the lower 
panel A = 1. N — 34 and (k) = 4.59, which corresponds to 
the Zachary Karate Club. On the x-axis is the number of 
communities, C. 

does not alter the qualitative result. We can not expect 
to get quantitative agreement since the model is only ac- 
counting for the general behavior of the C-dependence 
of the modularity with fixed A. We note that it is in 
principle possible to write down the C-dependence of A 
for a regular lattice but that has not been performed in 
the current study. However, the limitations of the model 
do not affect our general conclusion that the maximum 
in the modularity occurs for relatively small values of C 
for any connected network. 

IV. CONCLUSIONS 

We have presented a matrix formalism to describe the 
modularity of a community division in a network. We 



have described the modularity for some well studied net- 
works as well as some synthetic networks from a statis- 
tical point of view and introduced the concept of modu- 
larity density. 



In conclusion we found that the modularity density is 
quite similar for different networks. Even random net- 
works with no apparent community structure exhibit a 
remarkably similar modularity density. This suggests 
that most of the structure of the modularity density is 
independent of the network itself. We have introduced 
a simple model that describes the general shape of the 
modularity density based on the Pin/Pout concept of New- 
man and Girvan [ll| and concluded that any connected 
network must show a peak in the modularity density at 
a small number of communities compared to the size of 
the network. 



The presence of a general shape indicates that it 
should be possible to develop global optimization strate- 
gies which work well for most networks. The maximum 
modularity curve of course depends on the particular net- 
work. 
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